Technique for implementing a security algorithm

ABSTRACT

Performing a hash algorithm in a processor architecture to alleviate performance bottlenecks and improve overall algorithm performance. In one embodiment of the invention, the hash algorithm is pipelined within the processor architecture.

FIELD

Embodiments of the invention relate to network security algorithms. Moreparticularly, embodiments of the invention relate to the performance ofthe hash algorithms, including, for example, the secure hash algorithms(“SHA”) SHA-1, SHA-128, SHA-192, and SHA-256, as well as message digest(MD) algorithms, such as the MD5 algorithm, within network processorarchitectures.

BACKGROUND

Security algorithms may be used to encode or decode data transmitted orreceived in a computer network through techniques, such as compression.

In some instances, the network processor may compress or decompress thedata in order to help secure the integrity and/or privacy of theinformation being transmitted or received within the data. The data canbe compressed or decompressed by performing a variety of differentalgorithms, such as hash algorithms.

One such hash algorithm is the secure hash algorithm 1 (“SHA-1”)security algorithm. The SHA-1 algorithm can be a laborious andresource-consuming task for many network processors, however, as itrequires numerous mathematically intensive computations within a mainrecursive compression loop. Moreover, the main compression loop may beperformed numerous times in order to compress or decompress a particularamount of data.

In general, hash algorithms are algorithms that take a large group ofdata and reduce it to a smaller representation of that data. Hashalgorithms may be used in such applications as security algorithms toprotect data from corruption or detection. The SHA-1 algorithm, forexample, may reduce groups of 64 bytes of data to 20 bytes of data.Other hash algorithms, such as the SHA-128, SHA-129, and message digest5 (MD5) algorithms may also be used to reduce large groups of data tosmaller ones. Hash algorithms, in general, can be very taxing oncomputer system performance as the algorithm requires intensivemathematical computations in a recursive main compression loop that isperformed iteratively to compress or decompress groups of data.

Adding to the difficulty in performing the hash algorithms at highfrequencies are the latencies, or “bottlenecks,” that can occur betweenoperations of the algorithm due to data dependencies between theoperations. When performing the algorithm on typical processorarchitectures, the operations must be performed in substantiallysequential fashion because typical processor architectures perform theoperations of each iteration of the main compression loop on the samelogic units or group of logic units. As a result, if dependencies existbetween the iterations of the main loop, a bottleneck forms whileunexecuted iterations are delayed to allow the hardware to finishprocessing the earlier operations.

These bottlenecks can be somewhat abrogated by taking advantage ofinstruction-level parallelism (“ILP”) of instructions within thealgorithm and performing them in parallel execution units.

Typical prior art parallel execution unit architectures used to performhash algorithms have had marginal success. This is true, in part,because the instruction and sub-instruction operations associated withtypical hash algorithms rarely have the necessary ILP to allow trueindependent parallel execution. Furthermore, earlier architectures donot typically schedule operations in such a way as to minimize thecritical path associated with long dependency chains among variousoperations.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are illustrated by way of example and notlimitation in the figures of the accompanying drawings, in which likereferences indicate similar elements and in which:

FIG. 1 illustrates a processor architecture in which one embodiment ofthe invention may be used.

FIG. 2 illustrates a system in which one embodiment of the invention maybe used.

FIG. 3 illustrates an technique for performing a hash algorithm in apipelined architecture according to one embodiment of the invention.

FIG. 4 illustrates a method for performing a hash algorithm according toone embodiment of the invention.

DETAILED DESCRIPTION

Embodiments of the invention described herein relate to network securityalgorithms. More particularly, embodiments of the invention describedherein relate to a technique that may be used to improve the performanceof the hash algorithms without incurring significant cost.

At least one embodiment of the invention may be used to improve theperformance of hash algorithms by performing various operationsassociated with the algorithm concurrently, or “pipelining” theoperations, within a microprocessor. Pipelining the SHA-1 algorithm, forexample involves performing various iterations of the main compressionloop at different stages of a microprocessor concurrently in oneembodiment of the invention. The extent to which iterative operations ofthe algorithm may be performed depends, at least in part, on theinstruction level parallelism (ILP) of the microprocessor in which thealgorithm is executed.

FIG. 1 illustrates a processor architecture in which one embodiment ofthe invention may be used to perform a hash algorithm while reducingperformance degradation, or “bottlenecks,” within the processor. In theembodiment of the invention illustrated in FIG. 1, the pipelinearchitecture of the encryption portion 105 of the network processor 100may operate at frequencies at or near the operating frequency of thenetwork processor itself or, alternatively, at an operating frequencyequal to that of one or more logic circuits within the networkprocessor.

FIG. 2 illustrates a computer network in which an embodiment of theinvention may be used. The host computer 225 may communicate with aclient computer 210 or another host computer 215 by driving or receivingdata upon the bus 220. The data is received and transmitted across anetwork by a program running on a network processor embedded within thenetwork computers. At least one embodiment of the invention 205 may beimplemented within the host computer in order to compress that data thatis sent to the client computer(s).

For example, in one embodiment of the invention, a hash algorithmoperates on blocks of 512 bits of data at time. In such an embodiment,algorithm compression loop receives a 160-bit state input along with 512bits of data and produces a 160-bit state output by producing anintermediate state output that is a function of the state input and datainput and adds this to the state input in order to produce the 160-bitfinal state output.

Various processor architectures having various operating frequencies maybe used to facilitate the expedient compression or decompression of thedata using a hash algorithm. In one embodiment of the invention, hashalgorithms, such as the SHA-1, SHA-128, SHA-192, SHA-256, and themessage digest 5 (MD5) algorithms, are performed on pipelined processorarchitectures operating at frequencies up to, and in excess of, 1.4 GHz.

For at least one embodiment of the invention, the SHA-1 algorithm can beperformed with fewer performance bottlenecks and at greater operatingfrequencies by taking advantage of the recursive nature of the innerloop of the algorithm, which performs multiple iterations of theequation:TEMP=R ⁵(A)+F _(t)(B,C,D)+E+X _(t) +K _(t)Where:

-   -   E=D;    -   D=C;    -   C=R³⁰(B);    -   B=A;    -   A=TEMP;

A, B, C, D are chaining variables that changes state with each iterationof the loop, such that the function F_(t) produces a new result for eachiteration. The function F_(t) represents a mathematical operation oroperations that performs the SHA-1 algorithm. The function R^(x)(v) is aleft rotation function of x bits by a number of bit positions, v. Therotation function may be implemented using various logic devices,including a shifter. The equation is executed t number of times in orderto process X_(t) units of data Furthermore, the constant, K, may changeperiodically.

For example, in one embodiment of the invention, the loop executes 80times to process 512 bits of data, and the constant, K, changes every 20iterations of the loop.

The inner compression loop that executes the above equation may beperformed much faster by performing the loop in a pipelined processorarchitecture, wherein each iteration of the loop is performed by adedicated pipeline stage or stages. For example, FIG. 3 illustrates someof the pipeline stages and operations that are performed to execute theabove equation 80 times in order to process 512 bits of data.

In pipeline stage 1 305, the constant, K, is added to the first dataword 301, X₁, and the chaining variables, B, C, D, and E, are set totheir initial state 302. In pipeline stage 2 310, the result of stage 1is added to the chaining variable E₀ 306, chaining variables, B, C, andD, are applied to the function, F 307, and the constant, K, is added tothe data word, X₂ 308. In pipeline stage 3 320, the result of thefunction F is added 311 to the sum of E and the result from stage 1, theconstant, K, is added to the data word X₃ 313, the stage 2 result isadded to the chaining variable D 314, and the function is applied tochaining variables A, B, and C 315 after they have been rotated to theleft by 30 bits.

Because the inner loop is executed 80 times in order to process 512bits, the pipeline architecture illustrated in FIG. 3 may require 83stages to perform the all 80 iterations of the equation. Furthermore,the SHA-1 algorithm may require that the initial states of the chainingvariables be added to the final states of the chaining variables, whichrequires an extra 5 pipeline stages, for a total of 88 pipeline stagesto completely process the 512 bits of data.

In other embodiments, fewer pipeline stages may be used to process the512 bits by performing several operations at each pipeline stage.Furthermore, fewer or more pipeline stages may be used in otherembodiments of the invention depending upon the hash algorithm to beperformed. The determination of what pipeline stages perform whichoperations associated with each iteration of the equation is largelydetermined by scheduling logic that attempts to schedule theseoperations according to their data dependencies. In this manner, thescheduler can use the pipeline stages in the most efficient mannerpossible, thereby preventing lengthy bottlenecks.

For example, if an operation performed in iteration 2 of the inner loopis dependent upon data from an operation performed in iteration 1, thesecond pipeline may remain partially idle for a time until the data fromiteration 1 (performed in the first pipeline stage) is available.However, if these operations are performed in parallel with operationshaving similar data dependencies and therefore similar delay, thescheduler can perform these operations at one time at different pipelinestages. As a result, bottlenecks and delays incurred by datadependencies are minimized, allowing the pipeline architecture tooperate at higher frequencies limited only by the operating frequency ofthe processor architecture or hardware therein, such as the addercircuits.

FIG. 4 illustrates a method for carrying out the invention according toone embodiment. The main loop equation of the SHA-1 must be decoded intoseparate operations corresponding to the operations that must beperformed at each iteration of the loop at operation 405. Operationscorresponding to each iteration of the loop may then be scheduled forexecution within a particular pipeline stage at operation 410.

The choice of which stage in which to perform the operation(s) depends,at least in part, on the dependencies between the operations. In orderto perform the algorithm at the highest performance level possible, thecritical paths (operations having lengthy dependency chains) must befound and scheduled in such a manner so as to impose a minimum amount ofdelay on the performance of other operations and the algorithm ingeneral.

Many of the scheduled operations may then be performed by the variouspipeline stages in parallel at operation 415 if they have few or nodependencies from earlier operations.

In addition to performing the operations corresponding to each iterationof the main compression loop, the initial and final states of thechaining variables must be added to each other in order to produce thefinal output state at operation 420.

Embodiments of the invention may be performed using logic consisting ofstandard complementary metal-oxide-semiconductor (“CMOS”) devices(hardware) or by using instructions (software) stored upon amachine-readable medium, which when executed by a machine, such as aprocessor, cause the machine to perform a method to carry out the stepsof an embodiment of the invention. Alternatively, a combination ofhardware and software may be used to carry out embodiments of theinvention.

While the invention has been described with reference to illustrativeembodiments, this description is not intended to be construed in alimiting sense. Various modifications of the illustrative embodiments,as well as other embodiments, which are apparent to persons skilled inthe art to which the invention pertains are deemed to lie within thespirit and scope of the invention.

1. A processor comprising: a plurality of pipeline stages to perform aninner loop of a hash algorithm, the plurality of pipeline stagescomprising at least as many pipeline stages as there are iterations ofthe inner loop to be performed.
 2. The processor of claim 1 wherein theplurality of pipeline stages further comprises as many pipeline stagesas there are chaining variables to be used in the inner loop.
 3. Theprocessor of claim 2 wherein each pipeline stage comprises an adder, ashifter, and logic to perform a function.
 4. The processor of claim 3further comprising control logic to schedule operations to be executedwithin the plurality of pipeline stages.
 5. The processor of claim 4wherein operations are to be scheduled by the control logic and executedby the plurality of pipeline stages so as to minimize data dependenciesbetween iterations of the inner loop to be performed.
 6. The processorof claim 5 wherein the hash algorithm is chosen from a group of securehash algorithms (SHA) consisting of SHA-1, SHA-128, SHA-196, SHA-256,and message digest 5 (MD5).
 7. The processor of claim 6 wherein the hashalgorithm is to be performed at an operating frequency equal to that ofthe adder.
 8. The processor of claim 7 wherein the plurality of pipelinestages comprises 88 pipeline stages to process 512 bits of data.
 9. Anapparatus comprising: a first plurality of pipeline stages to perform ahash including: a first pipeline stage to add a first constant to afirst data word to yield a first result; a second pipeline stage to addthe first result a first chaining variable, perform a first function ona second, third, and fourth chaining variable to yield a second result,and add the first constant to a second data word to yield a thirdresult; a third pipeline stage to add the second result to the sum of afifth chaining variable and the first result, add the first constant toa third data word, add the third result to the fourth chaining variable,perform the first function on the first, second, and third chainingvariables after they each of have been shifted by a plurality of bits; asecond plurality of pipeline stages to add an initial state of thefirst, second, third, fourth, and fifth chaining variables to a finalstate of the first, second, third, fourth, and fifth chaining variables,respectively.
 10. The apparatus of claim 9 wherein the first pluralityof pipeline stages comprises 83 pipeline stages to process 512 bits ofinformation.
 11. The apparatus of claim 9 wherein the second pluralityof pipeline stages comprises 5 pipeline stages to process 512 bits ofinformation.
 12. The apparatus of claim 9 wherein the first and secondplurality of pipeline stages are implemented within a network processorarchitecture.
 13. The apparatus of claim 9 wherein the hash algorithm isa secure hash algorithm (SHA) and the plurality bits is
 30. 14. Theapparatus of claim 9 wherein the network processor architecture is toperform the hash algorithm at an operating frequency of at least 1.4GHz.
 15. A machine-readable medium having stored thereon a set ofinstructions, which if executed by a machine cause the machine toperform a method comprising: performing a plurality of iterations of aninner loop of an hash algorithm in parallel, the plurality of iterationsperformed in parallel being limited, at least in part, by dependenciesbetween each of the plurality of iterations of the inner loop; addinginitial values of a plurality of chaining variables to final values ofthe plurality of chaining variables, the final values being a result ofperforming the plurality of iterations of the inner loop.
 16. Themachine-readable medium of claim 16 wherein the method further comprisescontrolling scheduling of operations performed as a result of performingthe plurality of iterations of the inner loop, the scheduling beingcontrolled so as to minimize a critical path among the operations. 17.The machine-readable medium of claim 16 wherein the critical pathdepends upon the dependencies between the plurality of iterations of theinner loop.
 18. The machine-readable medium of claim 17 wherein themethod further comprises decoding the inner loop of the hash algorithminto a first number of operational stages, the first number ofoperational stages being equal to at least the plurality of iterations.19. The machine-readable medium of claim 18 wherein the inner loop is tobe performed to process a first number of data elements transmitted overa network.
 20. The machine-readable medium of claim 19 wherein the firstnumber of operational stages is at least 83 and the first number of dataelements comprises 512 bits.
 21. A method comprising: performing a hashalgorithm within a pipelined processor by performing a plurality ofiterations of an inner loop of the hash algorithm in parallel;generating a plurality of output data elements as a result of performingthe hash algorithm.
 22. The method of claim 21 further comprisingscheduling operations associated with the plurality of iterations so asto facilitate a maximum number of the operations to be performed inparallel.
 23. The method of claim 22 wherein the maximum number dependsupon dependencies between the operations.
 24. The method of claim 22wherein the output data elements are transmitted within a computernetwork.
 25. The method of claim 24 wherein the hash algorithm isperformed at substantially the same frequency as the operating frequencyof the processor.
 26. The method of claim 25 wherein the hash algorithmis performed at approximately 1.4 GHz.
 27. A system comprising: a memoryunit to store operations of a hash algorithm; a pipelined processor toperform the operations of the hash algorithm by performing iterations ofan inner loop of the hash algorithm within separate pipeline stages ofthe pipelined processor.
 28. The system of claim 27 wherein theoperations are scheduled so as to minimize the number dependencies amongthe operations.
 29. The system of claim 28 further comprising a bus uponwhich to drive data generated by performing the hash algorithm withinthe pipelined processor.
 30. The system of claim 28 further comprising abus to receive data to be operated on by the pipelined processor toperform the hash algorithm.
 31. The system of claim 30 wherein 512 bitsof data is to be processed by at least 83 pipeline stages of thepipelined processor.
 32. The system of claim 27 wherein the pipelinedprocessor is a network processor coupled to a network.
 33. The system ofclaim 32 further comprising a host processor coupled to the networkprocessor to perform a portion of the hash algorithm.
 34. The system ofclaim 27 wherein the hash algorithm is chosen from a group of securehash algorithms (SHA) consisting of SHA-1, SHA-128, SHA-196, SHA-256,and message digest 5 (MD5).
 35. An apparatus comprising: execution meansfor performing iterations of an inner loop of a hash algorithm inparallel including: first means for adding a first constant to a firstdata word to yield a first result; second means for adding the firstresult a first chaining variable, performing a first function on asecond, third, and fourth chaining variable to yield a second result,and adding the first constant to a second data word to yield a thirdresult; third means for adding the second result to the sum of a fifthchaining variable and the first result, adding the first constant to athird data word, adding the third result to the fourth chainingvariable, performing the first function on the first, second, and thirdchaining variables after they each of have been shifted by a pluralityof bits; adding means for adding an initial state of the first, second,third, fourth, and fifth chaining variables to a final state of thefirst, second, third, fourth, and fifth chaining variables,respectively; scheduling means for scheduling operations associated withthe hash algorithm.
 36. The apparatus of claim 35 wherein the executionmeans is a pipelined architecture and wherein each of the first, second,and third means are pipeline stages of the pipelined architecture. 37.The apparatus of claim 35 wherein the scheduling means is a controllerto schedule operations associated with the inner loop according todependencies among the operations.
 38. The apparatus of claim 36 whereineach iteration of the inner loop requires three pipeline stages toperform the iteration.
 39. The apparatus of claim 38 wherein the addingmeans comprises the same number of pipeline stages as chainingvariables.
 40. The apparatus of claim 35 wherein the hash algorithm ischosen from a group of secure hash algorithms (SHA) consisting of SHA-1,SHA-128, SHA-196, SHA-256, and message digest 5 (MD5).
 41. The apparatusof claim 35 wherein the plurality of bits is 30.